Date : 7/27/2020 1:33:33 PM 

From : "Cisneros, Gerardo" Gerardo.Cisneros@unt.edu 

To : "Yin (Whitney), Yuhui W." ywyin@UTMB.EDU, "Beuning, Penny" 
P.Beuning@northeastern.edu, "Luis Gabriel Brieba de Castro" 
luis.brieba@cinvestav.mx 

Subject : Re: [EXT] Re: grant? 

Attachment : 20-253591.pdf; 


WARNING: This email originated from outside of UTMB's email system. Do not click links or open 


attachments unless you recognize the sender and know the content is safe. 


That is great! | am excited to work on this proposal with you and everyone else. 


| agree about figuring out which mutants. | proposed a conversation today in the 
afternoon but there was no response from Luis or Penny. 


I've spend the day looking at more possible mutants, | found the following paper 
that talks about two other RDRP mutants, A406V and N874V, in addition to P323L: 


https://www. biorxiv.org/content/10.1101/2020.04.26.062471v2.full 


V5531, M611F, A613Y, A621G, Y649H and K794R, of which two were able to be 
expressed in complex with nsp14 (exo-) were reported in this paper: 


https://jvi.asm.org/content/90/16/7415 


| also found this paper from the WHO with a nice list (table 3) of mutations 
(attached) which reports one missense mutation on nsp7: S25L (among others). 


Based on the above, | suggest to pick mutants on RDRP and the one reported on 
nsp7 for the proposal. 


| will send another paper with initial forms for subcontracts, etc. If everyone is Ok, 
| can serve as contact Pl. | will also have Mr. Shawn Adams contact everyone to 
start coordinating the admin part of the proposal. 


Can you please send me a proposed budget at your earliest convenience? 


I'd suggest to write the proposal using a shared dropbox or google drive folder. Do 
you have a preference? 


Best, 
Andrés 


G. Andrés Cisneros, Ph.D 
Professor 


Department of Chemistry 

Center for Adv. Scientific Computing and Modeling (CASCaM) 
University of North Texas 
http://chemistry.unt.edu/~CisnerosResearch 
andres@unt.edu 

he/him/his 


From: Yin (Whitney), Yuhui W. <ywyin@UTMB.EDU> 
Sent: Monday, July 27, 2020 12:48 PM 

To: Cisneros, Gerardo <Gerardo.Cisneros@unt.edu> 
Subject: [EXT] Re: grant? 


Hi Andres, 


| would be happy to participate. 
Regarding the content of the grant, we do need to figure out what mutants 
are potentially interesting and our confidence in the existing data. 


Whitney 


From: Cisneros, Gerardo <Gerardo.Cisneros@unt.edu> 
Sent: Monday, July 27, 2020 11:30 AM 

To: Yin (Whitney), Yuhui W. <ywyin@UTMB.EDU> 
Subject: grant? 


WARNING: This email originated from outside of UTMB's email system. Do not click links or open 


attachments unless you recognize the sender and know the content is safe. 
Hi Whitney, 


| wanted to touch base with you about the mutations and whether you are on 
board to write the grant? Given the very short time-frame, | would appreciate if 
you can let me know today if you are interested in participating in the proposal? | 
need to let my res. office know about how we will work on the grant, lead 
institution, etc. and need to get the documents prepared. 


Thanks so much in advance, 
Andrés 


G. Andrés Cisneros, Ph.D 

Professor 

Department of Chemistry 

Center for Adv. Scientific Computing and Modeling (CASCaM) 
University of North Texas 
http://chemistry.unt.edu/~CisnerosResearch 
andres@unt.edu 

he/him/his 
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Variant analysis of SARS-CoV-2 genomes 


Takahiko Koyama,? Daniel Platt? & Laxmi Parida? 


Abstracts in Ye FAX, Francais, Pycckmia and Español at the end of each article. 


Introduction 


In late 2019, several people in Wuhan, China, were present- 
ing with severe pneumonia at the hospitals. As the number of 
patients rapidly increased, the Chinese government decided 
on 23 January 2020 to lock down the city to contain the virus. 
Unfortunately, the virus had already spread across China and 
throughout the world. The World Health Organization (WHO) 
oficially declared the outbreak a pandemic on March 11, 2020. 
As of 23 May 2020, over 5 million cases worldwide had been 
reported to WHO and the death toll has exceeded 330 000. 

Researchers isolated the virus causing the pneumonia in 
December 2019 and found it to be a strain of -coronavirus 
(CoV). The virus showed a high nucleotide sequence homol- 
ogy with two severe acute respiratory syndrome (SARS)-like 
bat coronaviruses, bat-SL-CoVZC45 and bat-SL-CoVZXC21 
(88% homology) and with SARS-CoV (79.5% homology), 
while only 50% homology with the Middle East respiratory 
syndrome coronavirus (MERS) CoV. The virus, now named 
SARS-CoV-2, contains a single positive stranded RNA (ribo- 
nucleic acid) of 30 kilobases, which encodes for 10 genes.’ 
Researchers have shown that the virus can enter cells by bind- 
ing the angiotensin-converting enzyme 2 (ACE2), through its 
receptor binding domain in the spike protein. 

The virus causes the coronavirus disease 2019 (CO- 
VID-19), with common symptoms such as fever, cough, 
shortness of breath and fatigue? Early data indicated that 
about 20% of patients develop severe COVID-19 requiring 
hospitalization, including 5% who are admitted to the intensive 
care unit.‘ Initial estimates of the case fatality rates were from 
3.4% to 6.6% which is lower than that of SARS or MERS, 9.6% 
and 34.3% respectively.” The mortality from COVID-19 is 
higher in people older than 65 years and in people with un- 
derlying comorbidities, such as chronic lung disease, serious 
heart conditions, high blood pressure, obesity and diabetes." 


Community transmission of the virus, as well as anti-viral 
treatments, can engender novel mutations in the virus, poten- 
tially resulting in more virulent strains with higher mortality 
rates or emergence of strains resistant to treatment. There- 
fore, systematic tracking of demographic and clinical patient 
information, as well as strain information is indispensable to 
effectively combat COVID-19. 

Here we analysed the SARS-CoV-2 genome from 10 022 
samples to understand the variability in the viral genome 
landscape and to identify emerging clades. 


Methods 


In total, we downloaded 15755 genome sequences from the 
following databases: the Chinese National Microbiology Data 
Center on | February 2020; the Chinese National Genom- 
ics Data Center Genome Warehouse on 4 February 2020; 
GISAID* on 1 May 2020 and GenBank on 1 May 2020. We 
removed redundant sequences with the China National Center 
for Bioinformation annotations. To reduce the number of false 
positive variants, we removed sequences with more than 50 
ambiguous bases. 

For this study, we used the sequence of established SARS- 
CoV-2 reference genome, NC_045512." This genome was 
sequenced in December 2019. Each sample was first aligned 
to the reference genome in a pairwise manner using EMBOSS 
needle (Hinxton, Cambridge, England), with a default gap 
penalty of 10 and extension penalty of 0.5.1 Then, we devel- 
oped a custom script in Python (Python Software Foundation, 
Wilmington, United States of America) to extract the differ- 
ences between the genome variants and the reference genome. 
Nucleotide variants in the coding regions were converted to 
corresponding encoded amino acid residues. For the open 
reading frame 1 (ORF1), we used the protein coordinates 
from YP_009724389.1* for translation. Finally, we carefully 
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investigated stop-gained and frameshift 
variants causing deletions and insertions 
to detect potential artefacts caused by 
undetermined or ambiguous bases. The 
results are provided in a list of variants 
(available in the data repository). 

Using the identified recurrent vari- 
ants, we performed hierarchical cluster- 
ing in SciPy library, Python, to identify 
clades. First, a binary matrix of samples 
and distinct variants was created. Then, 
we did hierarchical clustering using the 
Ward's method”! in SciPy library.” 

We investigated the mutation pat- 
terns of SARS-CoV-2 to find potential 
causes of mutations, by looking at the 
changes in bases. Since coronavirus ge- 
nomes are positive sense, single stranded 
RNA, we did not combine C >T with G 
> A mutations. 

The spike protein is a key protein 
for SARS-CoV-2 viral entry and a target 
for vaccine development. We, therefore, 
wanted to find amino acid conservation 
between other coronavirus sequences in 
the spike protein. We used the basic local 
alignment search tool BLAST (National 
Center for Biotechnology Information 
[NCBI], Bethesda, United States) 
followed by the constraint-based mul- 
tiple alignment tool COBALT (NCBI, 
Bethesda, United States). We carefully 
investigated mutations within the recep- 
tor binding domain and predicted B-cell 
epitopes.” The mutations were further 
analysed to identify cross species con- 
servation and to understand the nature 
of amino acid changes. We visualized the 
aligned sequence using the open source 
software aly.” 

For the phylogenetic analysis, we 
used the open source software Bayesian 
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evolutionary analysis by sampling trees 
(BEAST), version 2.5. BEAST uses a 
Bayesian Monte-Carlo algorithm gen- 
erating a distribution of likely phylog- 
enies given a set of priors, based on the 
probabilities of those tree configurations 
determined from the viral genomes. This 
analysis presents a different view than 
the variant analysis described above and 
is an independent test of the structure 
that individual haplogroup markers 
identity. First, we aligned sequences 
to NC_045512, using the multiple se- 
quence alignment software, MAFFT2 
Subsequently, we adjusted for length 
and sequencing errors, by truncating 
the bases in the 5’°-UTR and 3’-UTR, 
without losing key sites. We excluded 
sequences showing a variability higher 
than 30 bases. For an optimal output of 
the phylogenetic tree, we randomly se- 
lected a subset of 2000 samples by using 
a random number generator in Python. 
We ran BEAST using sample collection 
dates with the Hasegawa-Kishino-Yano 
mutation model * with the strict clock 
mode. Finally, we estimated the muta- 
tion rate and median tree height from 
the resulting BEAST trees. 


Results 


In total, we analysed 10022 SARS CoV-2 
genomes (sequences are available from 
the data repository)” from 68 countries. 
Most genomes came from the United 
States of America (3543 samples), fol- 
lowed by the United Kingdom of Great 
Britain and Northern Ireland (1987 sam- 
ples) and Australia (760 samples; Box 1). 
We detected in total 65776 variants with 
5775 distinct variants. The 5775 distinct 
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variants consist of 2969 missense muta- 
tions, 1965 synonymous mutations, 484 
mutations in the non-coding regions, 
142 non-coding deletions, 100 in-frame 
deletions, 66 non-coding insertions, 
36 stop-gained variants, 11 frameshift 
deletions and two in-frame insertions 
(Table 1). 

Of the 2969 missense variants, 
1905 variants are found in ORF lab, 
which is the longest ORF occupying two 
thirds of the entire genome. ORF lab 
is transcribed into a multiprotein and 
subsequently cleaved into 16 nonstruc- 
tural proteins (NSPs). Of these proteins, 
NSP3 has the largest number of missense 
variants among ORF lab proteins. Of 
the NSP3 missense variants, A58T was 
the most common (159 samples) fol- 
lowed by P153L (101 samples; Table 2). 
We also detected mutations in the 
nonstructural protein RNA-dependent 
RNA polymerase (RdRp), such as P323L 
(6319 samples). Deletions are also com- 
mon in 3’-5’ exonuclease (11 deletions) 
including those resulting in frameshifts. 
A comprehensive list of variants is avail- 
able in data repository.” 

Variants with recurrence over 100 
samples are shown in Table 3. The most 
common variants were the synonymous 
variant 3037C >T (6334 samples), OR- 
Flab P4715L (RdRp P323L; 6319 sam- 
ples) and SD614G (6294 samples). They 
occur simultaneously in over 3000 sam- 
ples, mainly from Europe and the United 
States. Other variants including ORF3a 
Q57H (2893 samples), ORFlab T2651 
(NSP3 T851; 2442 samples), ORF8 L84S 
(1669 samples), N203_204delinsKR 
(1573 samples), ORFlab L3606F (NSP6 
L37F; 1070 samples) were the key vari- 
ants for identifying clades. 

We identified six major clades with 
14 subclades (Fig. 1 and Table 4). The 
largest clade is D614G clade with five 
subclades. Most samples in the D614G 
clade also display the non-coding vari- 
ant 241C >T, the synonymous variant 
3037C >T and ORF lab P4715L. Within 
D614G clade, D614G/Q57H/T2651 
subclade forms the largest subclade 
with 2391 samples. The second largest 
major clade is L848 clade, which was 
observed among travellers from Wuhan 
in the early days of the outbreak, and 
the clade consists of 1662 samples with 
2 subclades. The L84S/P5828L/ subclade 
is predominantly observed in the United 
States. Among the L3606F subclades, 
L3606F/G251V/ forms the largest group 
with 419 samples. G251V frequently 
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appears in samples from the United 
Kingdom (329 samples), Australia (95 
samples), the United States (80 samples) 
and Iceland (76 samples). However, the 
basal clade now accounts only for a 
small fraction of genomes (670 samples 
mainly from China). The remaining two 
clades D448del and G392D are small 
and they are without any significant 
subclades at this point. 
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All non-coding deletions are either 
located within 3”-UTR, 5’-UTR or inter- 
genic regions. Of the in-frame deletions, 
ORF1 D448del stands out with 250 sam- 
ples. In contrast, we only detected two 
distinct in-frame insertions in our data 
set. We also detected 11 frameshift dele- 
tions and 36 stop-gained variants. The 
recurrent stop-gained variant Y4379* 
(NSP10 Y126*) is found in 51 samples 


in the D614G clade. NSP10 Y126* is 
located only 13 residues upstream of 
the stop codon; therefore, a truncation 
may not significantly affect function of 
the protein. Most of frameshift vari- 
ants in ORF lab do not recur except for 
$135fs (three samples) and L3606fs (two 
samples). Although frameshift variants 
are considered deleterious, for instance, 
$135fs (more precisely $135Rfs*9) 


Table 1. Number of gene variants in SARS-CoV-2 genomes,2019-2020 


E: envelope protein: M: membrane glycoprotein: N: nucleocapsid phosphoprotein; ORF: open reading frame: S: spike glycoprotein; SARS-CoV-2: severe acute respiratory 


syndrome coronavirus 2; UTR: untranslated region. 
4 Genes are in italics. 


Note: We compared 10022 genomes to the NC_045512 genome sequence.” 


Table 2. Number of variants in the open reading frame 1ab of SARS-CoV-2 genomes, by final cleaved protein, 2019-2020 


3CLPro: 3C like protease; ExoN: 3-'5’ exonuclease; NSP: non-structural protein; OMT: O-methyltransferase: RdRp: RNA-dependent RNA polymerase; SARS-CoV-2: severe 

acute respiratory syndrome coronavirus 2. 
© The open reading frame Tab gene codes for a polyprotein, which a viral protease cleaves in to several protein after translation. 
Note: We compared 10022 genomes to the NC_045512 genome sequence.” 
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Table 3. Variants of SARS-CoV-2 genomes observed in more than 100 samples, 2019-2020 


del: deletion: delins: deletion—insertion; ExoN: 3’-5’ exonuclease; NSP: non-structural protein; M: membrane glycoprotein; N: nucleocapsid phosphoprotein; NA: not 


applicable; ORF: open reading frame; RdRp: RNA-dependent RNA polymerase: SARS-CoV-2: severe acute respiratory syndrome coronavirus 2: S: spike glycoprotein; UTR: 
untranslated region. 


Note: We compared 10022 genomes to the NC_045512 genome sequence.” 


Bull World Health Organ 2020:98:495-504 | doi: http://dx.doi.org/10.2471/BLT.20.253591 


Takahiko Koyama et al. 


Research 


Severe acute respiratory syndrome coronavirus 2 genomes 


Fig. 1. Agraphical representation of variants found in SARS-CoV-2 genomes, 2019-2020 
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3CLPro: 3C like protease: del: deletion; delins: deletion—insertion; E: envelope protein; EXON: 3’-5’ exonuclease; M: membrane glycoprotein: N: nucleocapsid 
phosphoprotein; NA: not applicable; NSP: non-structural protein; OMT: O-methyltransferase: ORF: open reading frame; RdRp: RNA-dependent RNA polymerase; 


SARS-CoV-2: severe acute respiratory syndrome coronavirus 2; S: spike glycoprotein: UTR: untranslated region. 


Notes: Variants are coloured depending on the type of mutations (missense, synonymous, non-coding, stop-gained, and frameshift). Major variants are annotated, 
and clades are indicated by horizontal colour stripes. Continents and countries from where samples originated are shown in the bars on the left. The gene 
structure is displayed at the bottom. Countries with samples in the African continent: Algeria, Democratic Republic of the Congo, Egypt, Gambia, Senegal and 
South Africa: Asian continent: Cambodia, China, Georgia, India, Iran (Islamic Republic of), Israel, Japan, Jordan, Kuwait, Malaysia, Nepal, Pakistan, Philippines, Qatar, 
Republic of Korea, Saudi Arabia, Singapore. Sri Lanka, Thailand and Viet Nam; European continent: Austria, Belarus, Belgium, Czechia, Denmark, Estonia, Finland, 
France, Germany, Greece, Hungary, Iceland, Ireland, Italy, Latvia, Lithuania, Luxembourg, Netherlands, Norway, Poland, Portugal, Slovakia, Slovenia, Spain, Sweden, 
Switzerland, Russian Federation, Turkey and United Kingdom; North America: Canada, Mexico and United States; Oceania; Australia and New Zealand; South 


America: Argentina, Brazil, Chile, Colombia, Costa Rica and Peru. 


Table 4. Major clades of SARS-CoV-2 genomes, 2019-2020 


Del: deletion: delins: deletion—insertion; SARS-CoV-2: severe acute respiratory syndrome coronavirus 2. 
8 The reference genome (NC_045512)' used in this study belongs to the basal clade. 
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caused by 670_671del, ORFlab is 
truncated at residue 143 before NSP2 
and translation might resume from the 
methionine at residue 174 near the end 
ofNSP1. Other notable recurrent frame- 


Fig. 2. Base pair changes observed in 
SARS-CoV-2 genomes, 2019- 
2020 


—= Transition 
=> Transversion 


SARS-CoV-2: severe acute respiratory syndrome 
coronavirus 2. 

Notes: The data come from 10 022 analysed 
genomes. The arrows indicate how bases are 
changed. Numbers next to the arrows indicate 
the number of distinct variants with those types 
of changes. 


shift variants include ORF3a V256fs and 
ORF7 I103fs. 

The most common base change is C 
>T (Fig. 2). As expected,*! we observed a 
strong bias in transition versus transver- 
sion ratio (7:3). C >T transitions might 
be intervened by cytosine deaminases. 
Surprisingly, G >T transversions, likely 
introduced by oxo-guanine from reac- 
tive oxygen species, were also fre- 
quently observed. 

Assessing variants in the spike 
protein revealed 427 distinct non-syn- 
onymous variants with many variants 
located within the receptor binding 
domain and B-cell epitopes (Fig. 3). 
Among the variants in the receptor 
binding domain, V483A (26 samples), 
G476S (9 samples) and V367F (12 
samples) are highly recurrent. 

Fig. 4 shows the consensus tree 
from the phylogenetic analysis. The tree 
has a coalescence centre with exponen- 
tial expansion identified by haplotype 
markers. The colour mapped phylog- 
enies largely support the 14 identified 
subclades. We note that substantial 
numbers of samples from the United 
States show afinity with European lin- 
eages rather than those directly derived 
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from East Asia. Except for the earliest 
cases, European clades dominate even 
in samples from western states in the 
United States. Further, European sam- 
ples tend to associate with lineages that 
expanded through Australia. 

Estimation of mutation rate showed 
a median of 1.12 x 10° mutations per 
site-year (95% confidence interval, CT: 
9.86 x 10* to 1.85 x 104). The median 
tree height was 5.1 months (95% CI: 
4.8 to 5.52). 


Discussion 


Here we show the evolution of the SARS- 
Co-2 genome as it has spread across 
the world. Although, our methods do 
not allow us to investigate whether the 
mutations observed led to a loss or gain 
of function, we can speculate on the 
implications of viral function of these 
mutations. 

The most common clade identi- 
fied was the D614G variant, which 
is located in a B-cell epitope with a 
highly immunodominant region and 
may therefore affect vaccine effective- 
ness.* Although amino acids are quite 
conserved in this epitope, we identified 


Fig. 3. Annotation of SARS-C0-2 variants in the alignment of the amino acid sequence of the spike protein from several coronaviruses, 
2019-2020 


SARS-CoV-2: severe acute respiratory syndrome coronavirus 2. 
Notes: We aligned amino acids sequences of the Spike protein from SARS-CoV-2 (YP_009724390.1), Bat CoV RaTG13 (QHR63300.2), Bat SARS-like CoVs(AVP78042.1, 
AVP78031.1, ATO98205.1 and ATO98157.1) and SARS-like CoV WIV16 (ALK02457 1). Receptor binding domain and predicted B-cell epitopes are highlighted and 
the variants we identified in those segments are marked. The colour coding for the amino acids is by amino acid characteristic. 
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RBD: receptor binding domain 
B-cell epitope (predicted) 
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Fig. 4. Phylogenetic tree for the SARS-CoV-2 genomes, 2019-2020 


SARS-CoV-2: severe acute respiratory syndrome coronavirus 2. 
Notes: Each sample is coloured with corresponding subclade. We used the Bayesian evolutionary analysis by sampling trees software? 


14 other variants besides D614G. Al- 
most all strains with D614G mutation 
also have a mutation in the protein 
responsible for replication (ORF lab 
P4715L; RdRp P323L), which might af- 
fect replication speed of the virus. This 
protein is the target of the anti-viral 
drugs, remdesivir and favipiravir, and 
the susceptibility for mutations sug- 
gests that treatment resistive strains 
may emerge quickly. Mutations in the 
receptor binding domain of the spike 
protein suggest that these variants are 
unlikely to reduce binding afinity with 
ACE2, since that would decrease the fit- 
ness of the virus. V483A and G476S are 
primarily observed in samples from the 
United States, whereas V367F is found 
in samples from China, Hong Kong 
Special Administrative Region, France 
and the Netherlands. The V367F and 
D364Y variants have been reported to 
enhance the structural stability of the 
spike protein facilitating more eficient 
binding to the ACE? receptor.** In sum- 
mary, structural and functional changes 
concomitant with spike protein muta- 
tions should be meticulously studied 
during therapy design and development. 

We detected several non-recurring 
frameshift variants, which can be se- 


quencing artefacts. The frameshift at 
Y3 in ORF10, although only detected 
in one sample, might not be essential for 
survival of the new coronavirus, since 
ORF 10, a short 38-residue peptide, is 
not homologous with other proteins in 
the NCBI repository. 

The phylogenetic analysis suggest 
population structuring in the evolution 
of SARS-CoV-2. The analysis provides 
an independent test of the major clades 
we identified, as well as the geographic 
expansions of the variants. While the 
earliest samples from the United Stated 
appear to be derived from China, be- 
longing either to basal or L84S clades, 
the European clades, such as D614G/ 
Q57H, tend to associate with most of 
the subsequent increase in infected 
people in the United States. D614G was 
first observed in late January in China 
and became the largest clade in three 
months. The mutation rate of 1.12 x 103 
mutations per site-year is similar to 
0.80 x 107 to 2.38 x 10% mutations per 
site-year reported for SARS-CoV-1.% 

The rapid increase of infected 
people will provide more genome 
samples that could offer further insights 
to the viral dissemination, particularly 
the possibility of at least two zoonotic 
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transmissions of SARS-CoV-2 into the 
human population. An understanding of 
the biological reservoirs carrying coro- 
naviruses and the modalities of contact 
with human population through trade, 
travel or recreation will be important 
to understand future risks for novel 
infections. Further, populations may be 
infected or even re-infected via multiple 
travel routes. 

The number of people with con- 
firmed COVID-19 has rapidly increased 
over the last five months with no sign 
of decline in the near future. The fight 
against COVID-19 will be long, until 
vaccines and other effective therapies are 
developed. To facilitate rapid therapeu- 
tic development, clinicopathological, 
genomic and other societal information 
must be shared with researchers, physi- 
cians and public health oficials. Given 
the evolving nature of the SARS-CoV-2 
genome, drug and vaccine developers 
should continue to be vigilant for emer- 
gence of new variants or sub-strains of 
the virus. m 
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Résumé 


Analyse des variantes du génome de SARS-CoV-2 

Objectif Analyser les variantes du génome de coronavirus 2 du 
syndrome respiratoire aigu sévère (SARS-CoV-2), 

Méthodes Entre le 1* février et le 1" mai 2020, nous avons téléchargé 
10 022 génomes de SARS CoV-2 issus de quatre bases de données, Ces 
génomes provenaient de patients infectés originaires de 68 pays. Nous 
avons identifié les variantes en procédant à un alignement par paires 
avec la séquence de référence NC_045512, à l'aide de l'outil EMBOSS 
Needle. Les variantes de nucléotides dans les régions codantes ont été 
converties en résidus d'acides aminés codés correspondants. Enfin, pour 
analyser le clade, nous avons employé un logiciel open source appelé 
Bayesian Evolutionary Analysis by Sampling Trees, version 2.5. 
Résultats Nous avons détecté 5775 variantes de génome distinctes, 
dont 2969 mutations faux-sens, 1965 mutations synonymes, 484 
mutations dans les régions non codantes, 142 délétions non codantes, 
100 délétions sans décalage du cadre de lecture, 66 insertions non 
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codantes, 36 variantes de codon stop, 11 délétions entrainant un 
décalage du cadre de lecture, et 2 insertions sans décalage du cadre 
de lecture, Les variantes les plus fréquentes étaient les synonymes 
3037C >T (6334 échantillons), P4715L dans le cadre ouvert de lecture 
1ab (6319 échantillons) et D614G dans la protéine de spicule (6294 
échantillons). Nous avons identifié six clades majeurs (à savoir, de base, 
D614G, L845, L3606F, D448del et G392D) et 14 sous-clades. Quant aux 
changements de base, la mutation C >T était la plus répandue avec 
1670 variantes distinctes. 

Conclusion Nous avons constaté qu'il existait de nombreuses variantes 
du génome de SARS-CoV-2, et que le clade D614G était devenu la 
variante la plus commune depuis décembre 2019, L'analyse évolutive a 
indiqué une transmission structurée, avec une possibilité d'introductions 
multiples au sein de la population. 
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Pezrome 


AHany3 BapuaHTOB reHOMOB SARS-CoV-2 

Llenb [poaHanmanpoaTb BapuaHTbl FEHOMOB TAKENOTO OCTPOTO 
pecnupaTopHoro CHHAPOMA, BHI3BAHHOTO KopoHaBupycom-2 (SARS- 
CoV-2}. 

Metogp!i B nepuog Mexay 1 espana n 1 Mag 2020 roga aBropel 
3arpy3unu gaHHble no 10 022 reHomam Bupyca SARS CoV-2 n3 
yeTbipex 6a3 AaHHEIX. TeHomb! MpuUHaAanexana UHPULMPOBaHHbIM 
naųneHTaM n3 68 CTPaH. ABTOPEI MAEHTUPUUMpOBaNU BAPHAHTEI, 
u3nekaa n nonapHo cpaBHuBaa NOCNeAOBaTeNbHOCTH C STANOHHEIM 
reHomom NC_045512, ucnonb3ya Hadop nHcrpymeHTOB EMBOSS. 
BapnaHTbi HykneotTngHoň nocnegoBaTenbHOCTN B KOHUPYHOLLUX 
yuactkax binn mpeobpa3oBahHbl B COOTBETCTBYIOLUMe KOAMpyemble 
amunHokucnoTHble ocTaTKu. [na aHanu3a Knag HCNONL3OBANOCE 
nporpammHoe obecneyeHne c OTKPbITbIM KOHOM Ana GañecoBckoro 
3BONIOLMOHHOFO aHanu3a JepesbeB BbIOÜOPKH, Bepona 2.5. 
Pe3ynbTaTbi Bbino MASHTUPULMpoBaHo 5775 YETKUX BAPHAHTOB 
reHoMa, B TOM uucne 2969 mucceHc-MyTaL nú, 1965 CAHOHUMMYHbIX 
MyTayuú, 484 myTayun B HeKOgupyounx yYacTKaX, 


Resumen 


Análisis de variantes de los genomas del SARS-CoV-2 

Objetivo Analizar las variantes del genoma del coronavirus tipo 2 del 
sindrome respiratorio agudo grave (SARS-CoV-2), 

Métodos Entre el 1 de febrero y el 1 de mayo de 2020, se registraron 
10022 genomas del CoV-2 del SARS en cuatro bases de datos. Los 
genomas eran de pacientes infectados ubicados en 68 países, Se 
identificaron variantes al extraer la alineación por pares del genoma 
de referencia NC_045512, por medio de EMBOSS Needle. Las variantes 
de los nucleótidos en las regiones codificantes se convirtieron en los 
correspondientes residuos de aminoácidos codificados. Para analizarlos 
clados, se utilizó el programa informático de código abierto Bayesian 
evolutionary analysis by sampling trees, versión 2.5. 

Resultados Se identificaron 5775 variaciones diferentes del genoma, 
incluidas 2969 mutaciones con cambio de sentido, 1965 mutaciones 
sinónimas, 484 mutaciones en las regiones no codificantes, 142 
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142 Hekogupybuue aeneuun, 100 xeneuu BHYTpY pamku 
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BcTpeuanace B 1670 BapuaHTax. 
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HauGonee pacnpocTpaHeHHblM BAPHaHTOM aBnaeTca knapa D6146. 
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nepenayy FEHETHHECKAX AAHHbIX C BO3MOKHOCTIO MHOTOKPATHOÏ 
UHTPOAYKLIUU B MONYNALUIO. 


supresiones no codificantes, 100 supresiones en la fase, 66 inserciones 
no codificantes, 36 variaciones de parada prematuras (stop-gained), 11 
supresiones de desplazamiento de fase y dos inserciones en la fase. Las 
variaciones más comunes eran las sinónimas 3037C > T (6334 muestras), 
P4715L en la fase abierta de lectura 1ab (6319 muestras) y D614G en la 
proteína S (6294 muestras). Se identificaron seis clados principales, (es 
decir, basal, D614G, L84S, L3606F D448del y G392D) y 14 subclados. En 
relación con los cambios de base, la mutación C > T fue la más común 
con 1670 variaciones diferentes. 

Conclusión Se determinó que existen diversas variaciones del genoma 
del SARS-CoV-2 y que el clado D614G es la variante más común 
desde diciembre de 2019. El análisis evolutivo indicó una transmisión 
estructurada, en la que existe la posibilidad de que se realicen múltiples 
inserciones en la población. 


7. Guan WJ, Ni ZY, Hu Y, Liang WH, Ou CQ, He JX, et al; China Medical 
Treatment Expert Group for Covid-19. Clinical characteristics of coronavirus 
disease 2019 in China. N Engl J Med. 2020 04 30;382(18):1708-20. doi: 
http://dx.doi.org/10,1056/NEJMoa2002032 PMID: 32109013 

8. Wu Z, McGoogan JM. Characteristics of and important lessons from 
the coronavirus disease 2019 (COVID-19) outbreak in China: summary 
of a report of 72 314 cases from the Chinese Center for Disease Control 
and Prevention. JAMA. 2020 Feb 24:323(13):1239-42, doi: http://dx.doi. 
org/10.1001/jama.2020.2648 PMID: 32091533 

9. Wang C, Horby PW, Hayden FG, Gao GF. A novel coronavirus outbreak of 
global health concern. Lancet, 2020 02 15;395(10223).470-3. doi: http:// 
dx.doi.org/10.1016/50140-6736(20)30185-9 PMID: 31986257 

10. Cumulative Number of Reported Probable Cases of SARS [internet]. 
Geneva: World Health Organization; 2020. https:/Awww.who.int/csr/sars/ 
country/2003_07_11/enY [cited 2020 May 29]. 

11. Middle East respiratory syndrome coronavirus (MERS-CoV) [internet]. 
Geneva: World Health Organization; 2020. https://www.who.int/ 
emergencies/mers-cov/en/ [cited 2020 May 29]. 

12. Richardson $, Hirsch JS, Narasimhan M, Crawford JM, McGinn T, Davidson 
KW, et al; and the Northwell COVID-19 Research Consortium. Presenting 
characteristics, comorbidities, and outcomes among 5700 patients 
hospitalized with COVID-19 in the New York city area. JAMA. 2020 Apr 22. 
Epub ahead of print. doi: http://dx.doi.org/10.1001/jama.2020,6775 PMID: 
32320003 


Bull World Health Organ 2020;98:495-504| doi: http://dx.doi.org/10.2471/BLT.20.253591 503 


Research 
Severe acute respiratory syndrome coronavirus 2 genomes 


13. 


14. 


20. 


21. 


22. 


23. 


24. 


504 


Chen N, Zhou M, Dong X, Qu J, Gong F, Han Y, et al. Epidemiological and 
clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in 
Wuhan, China: a descriptive study. Lancet. 2020 02 15;395(10223)507-13. 
doi: http://dx.doi.org/10.1016/S0140-6736(20)30211-7 PMID: 32007143 
Yang X, Yu Y, Xu J, Shu H, Xia J, Liu H, et al. Clinical course and outcomes 

of critically ill patients with SARS-CoV-2 pneumonia in Wuhan, China: a 
single-centered, retrospective, observational study. Lancet Respir Med. 2020 
05;8(5):475-81. doi: http://dx.doi.org/10.1016/$2213-2600(20)30079-5 
PMID: 32105632 

Sanjuan R, Domingo-Calap P Mechanisms of viral mutation. Cell Mol Life 
Sci. 2016 12:73(23):4433-48, doi: http.//dx.doi.org/10.1007/s00018-016- 
2299-6 PMID: 27392606 
ShuY, McCauley J. GISAID: global initiative on sharing all influenza data 

- from vision to reality. Euro Surveill. 2017 03 30:22(13):30494, doi: http:// 
dx.doi.org/10.2807/1560-7917.E5.2017.22,13.30494 PMID: 28382917 
Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, 
complete genome. NCBI Reference Sequence: NC_045512.2. Bethesda: 
National Center for Biotechnology Information; 2020. Available from: 
https://www.nebi.nim.nih.gov/nuccore/1798174254 [cited 2020 May 19]. 
Needleman SB, Wunsch CD. A general method applicable to the search for 
similarities in the amino acid sequence of two proteins. J Mol Biol. 1970 
Mar48(3)443-53. doi: http://dx.doi.org/10.1016/0022-2836(70)90057-4 
PMID: 5420325 

orf1ab polyprotein [Severe acute respiratory syndrome coronavirus 2]. 

NCBI Reference Sequence: YP_009724389 1. Bethesda: National Center for 
Biotechnology Information; 2020. Available from: https://www.ncbi.nlm.nih. 
gov/protein/1796318597 [cited 2020 May 29]. 

Koyama T, Platt D, Parida L. Variant analysis of SARS-CoV-2 genomes [data 
repository]. Meyrin: European Organization for Nuclear Research; 2020. doi: 
http://dx.doi.org/10.5281/zenodo.3840465doi: http://dx.doi.org/10.5281/ 
zenodo.3840465 

Ward JH Jr. Hierarchical Grouping to Optimize an Objective Function. J Am 
Stat Assoc. 1963;58(301):236—44. doi: http://dx.doi.org/10.1080/01621459.1 
963.10500845 

Virtanen P Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, 
etal. SciPy 1.0 Contributors. SciPy 1.0: fundamental algorithms for scientific 
computing in Python. Nat Methods, 2020 03:17(3):261-72. doi: http:// 
dx.doi.org/10.1038/s41592-019-0686-2 PMID: 32015543 

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment 
search tool. J Mol Biol. 1990 Oct 5:215(3):403-10, doi: http://dx.doi. 
org/10.1016/S0022-2836(05)80360-2 PMID: 2231712 

Papadopoulos JS, Agarwala R. COBALT: constraint-based alignment tool for 
multiple protein sequences. Bioinformatics. 2007 May 1:23(9):1073-9. doi: 
http://dx.doi.org/10.1093/bioinformatics/btm076 PMID: 17332019 


25. 


26. 


27. 


28. 


29. 


30. 


31. 


32. 


33. 


34, 


35. 


Takahiko Koyama et al. 


Grifoni A, Sidney J, Zhang Y, Scheuermann RH, Peters B, Sette A. A sequence 
homology and bioinformatic approach can predict candidate targets for 
immune responses to SARS-CoV-2. Cell Host Microbe. 2020 04 8;27(4):671- 
680.82. doi: http://dx.doi.org/10.1016/jchom.2020.03.002 PMID: 32183941 
Liu Z, Xiao X, Wei X, Li J, Yang J, Tan H, et al. Composition and divergence 

of coronavirus spike proteins and host ACE2 receptors predict potential 
intermediate hosts of SARS-CoV-2. J Med Virol. 2020 Feb 26;92(6):595-601. 
doi: http://dx.doi.org/10.1002/jmv.25726 PMID: 32100877 

Arvestad L. alv: a console-based viewer for molecular sequence alignments. 
J Open Source Softw. 2018:3(31):955, doi: http://dx.doi.org/10.21105/ 
joss,00955 

Bouckaert R, Vaughan TG, Barido-Sottani J, Duchéne $, Fourment 

M, Gavryushkina A, et al. BEAST 2.5: An advanced software platform 

for Bayesian evolutionary analysis. PLOS Comput Biol. 2019 04 
8:15(4):e1006650. doi: http://dx.doi.org/10.1371/journal.pcbi.1006650 
PMID: 30958812 
Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: a novel method for rapid 
multiple sequence alignment based on fast Fourier transform. Nucleic Acids 
Res. 2002 Jul 15:30(14):3059-66. doi: http://dx.doi.org/10.1093/nar/gkf436 
PMID: 12136088 
Hasegawa M, Kishino H, Yano T. Dating of the human-ape splitting by a 
molecular clock of mitochondrial DNA. J Mol Evol. 1985:22(2):160-74. doi: 
http://dx.doi.org/10.1007/BF02101694 PMID: 3934395 

Lyons DM, Lauring AS. Evidence for the selective basis of transition-to- 
transversion substitution bias in two RNA viruses. Mol Biol Evol. 2017 Dec 
1:34(12):3205-15. doi: http://dx.doi.org/10.1093/molbev/msx251 PMID: 
29029187 
Li Z, Wu J, Deleo CJ. RNA damage and surveillance under oxidative 

stress. IUBMB Life. 2006 Oct;58(10):581-8. doi: http://dx.doi. 
org/10.1080/15216540600946456 PMID: 17050375 

Koyama T, Weeraratne D, Snowdon JL, Parida L. Emergence of drift 

variants that may affect COVID-19 vaccine development and antibody 
treatment. Pathogens. 2020 04 26:9(5):324. doi: http://dx.doi.org/10.3390/ 
pathogens9050324 PMID: 32357545 

Ou J, Zhou Z, Dai R, Zhang J, Lan W, Zhao S, et al. Emergence of RBD 
mutations in circulating SARS-CoV-2 strains enhancing the structural 
stability and human ACE2 receptor afinity of the spike protein. [preprint]. 
Cold Spring Habor: medRxiv; 2020. doi: http://dx.doi.org/10.1101/2020.03.1 
5.991844doi: http://dx.doi.org/10.1101/2020.03,15.991844 

Zhao Z, Li H, Wu X, Zhong Y, Zhang K, Zhang YP. et al. Moderate mutation 
rate in the SARS coronavirus genome and its implications. BMC Evol Biol. 
2004 06 28:4(1):21. doi: http://dx.doi.org/10.1186/1471-2148-4-21 PMID: 
15222897 


Bull World Health Organ 2020:98:495-504 | doi: http://dx.doi.org/10.2471/BLT.20.253591 


